Executive summary

Using a data sample from Tongji Hospital in Wuhan from January 10 to February 18, 2020, the study determined the most relevant parameters to estimate the probability of death for a patient suspected of having COVID-19 virus. The values of the following blood indicators were identified as the parameters most strongly correlated with the probability of death:

The age of the patients is also correlated with patients survivability, however correlation does not imply causation, so therefore the age was decided not to be used in the analysis.

The conducted analysis confirms the key parameter conculsions stated by Li Yan et al. in the article „An interpretable mortality prediction model for COVID-19 patients”. The importance of Lactate dehydrogenase, High sensivity C-reactive protein and Lymphocyte levels was also confirmed in the data presented in the report.


Wykorzystując próbkę danych ze szpitala Tongji w Wuhan z okresu od 10 stycznia do 18 lutego 2020 r., w badaniu określono najistotniejsze parametry pozwalające oszacować prawdopodobieństwo zgonu pacjenta podejrzanego o zakażenie wirusem COVID-19. Wartości następujących wskaźników krwi zostały zidentyfikowane jako parametry najsilniej skorelowane z prawdopodobieństwem zgonu:

Wiek pacjentów jest również skorelowany z przeżywalnością pacjentów, jednak korelacja nie oznacza związku przyczynowego, dlatego też zdecydowano się nie uwzględniać wieku w analizie.

Przeprowadzona analiza potwierdza kluczowe zbieżności parametrów podane przez Li Yan i wsp. w artykule "“An interpretable mortality prediction model for COVID-19 patients”. Znaczenie dehydrogenazy mleczanowej, białka C-reaktywnego o wysokiej czułości oraz poziomu limfocytów zostało również potwierdzone w danych przedstawionych w raporcie.

Used R libraries

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
## [3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
## [5] LC_TIME=Polish_Poland.1250    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] caret_6.0-88    lattice_0.20-44 plotly_4.9.3    ggplot2_3.3.3  
## [5] rmarkdown_2.8   knitr_1.33      zoo_1.8-9       readxl_1.3.1   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6           lubridate_1.7.10     tidyr_1.1.3         
##  [4] class_7.3-19         digest_0.6.27        ipred_0.9-11        
##  [7] foreach_1.5.1        utf8_1.2.1           R6_2.5.0            
## [10] cellranger_1.1.0     plyr_1.8.6           stats4_4.1.0        
## [13] evaluate_0.14        httr_1.4.2           pillar_1.6.1        
## [16] rlang_0.4.11         lazyeval_0.2.2       data.table_1.14.0   
## [19] rpart_4.1-15         Matrix_1.3-3         splines_4.1.0       
## [22] gower_0.2.2          stringr_1.4.0        htmlwidgets_1.5.3   
## [25] munsell_0.5.0        compiler_4.1.0       xfun_0.23           
## [28] pkgconfig_2.0.3      htmltools_0.5.1.1    nnet_7.3-16         
## [31] tidyselect_1.1.1     tibble_3.1.2         prodlim_2019.11.13  
## [34] codetools_0.2-18     fansi_0.5.0          viridisLite_0.4.0   
## [37] crayon_1.4.1         dplyr_1.0.6          withr_2.4.2         
## [40] MASS_7.3-54          recipes_0.1.16       ModelMetrics_1.2.2.2
## [43] grid_4.1.0           nlme_3.1-152         jsonlite_1.7.2      
## [46] gtable_0.3.0         lifecycle_1.0.0      magrittr_2.0.1      
## [49] pROC_1.17.0.1        scales_1.1.1         stringi_1.6.1       
## [52] reshape2_1.4.4       timeDate_3043.102    ellipsis_0.3.2      
## [55] generics_0.1.0       vctrs_0.3.8          lava_1.6.9          
## [58] iterators_1.0.13     tools_4.1.0          glue_1.4.2          
## [61] purrr_0.3.4          survival_3.2-11      yaml_2.2.1          
## [64] colorspace_2.0-1

Code ensuring repeatability of results each time the report runs on the same data

rm(list=ls())

Code allowing to load data from the input file

setwd("C:/Users/adamc/OneDrive/Desktop/Studia Podyplomowe/Projekt R")

coviddata <- read_excel("wuhan_blood_sample_data_Jan_Feb_2020.xlsx")
## New names:
## * `` -> ...1

Data cleansing code

colnames(coviddata)[1] <- "Patient ID"
colnames(coviddata)[2] <- "Date of entry"
colnames(coviddata)[3] <- "Age"
colnames(coviddata)[4] <- "Gender"
colnames(coviddata)[7] <- "Outcome"
colnames(coviddata)[9] <- "Hemoglobin"
colnames(coviddata)[12] <- "Procalcitonin"
colnames(coviddata)[13] <- "Eosinophils"
colnames(coviddata)[16] <- "Albumin"
colnames(coviddata)[17] <- "Basophil"
colnames(coviddata)[21] <- "Monocytes"
colnames(coviddata)[22] <- "Antithrombin"
colnames(coviddata)[24] <- "Indirect bilirubin"
colnames(coviddata)[26] <- "Neutrophils"
colnames(coviddata)[27] <- "Total protein"
colnames(coviddata)[31] <- "Mean corpuscular volume"
colnames(coviddata)[32] <- "Hematocrit"
colnames(coviddata)[33] <- "White blood cell count"
colnames(coviddata)[34] <- "Tumor necrosis factor alpha"
colnames(coviddata)[35] <- "Mean corpuscular hemoglobin concentration"
colnames(coviddata)[36] <- "Fibrinogen"
colnames(coviddata)[39] <- "Lymphocyte count"
colnames(coviddata)[45] <- "Glucose"
colnames(coviddata)[46] <- "Neutrophils count"
colnames(coviddata)[49] <- "Ferritin"
colnames(coviddata)[52] <- "Lymphocyte"
colnames(coviddata)[56] <- "Aspartate aminotransferase"
colnames(coviddata)[59] <- "Calcium"
colnames(coviddata)[62] <- "Platelet large cell ratio"
colnames(coviddata)[65] <- "Monocytes  count"
colnames(coviddata)[67] <- "Globuline"
colnames(coviddata)[68] <- "Gamma-glutamyl transpeptidase"
colnames(coviddata)[70] <- "Basophil count"
colnames(coviddata)[72] <- "Mean corpuscular hemoglobin"
colnames(coviddata)[76] <- "Serum sodium"
colnames(coviddata)[77] <- "Thrombocytocrit"
colnames(coviddata)[79] <- "Glutamic-pyruvid transaminase"
colnames(coviddata)[81] <- "Creatinine"

Section summarizing dataset size and basic statistics

print(paste("Number of gathered inputs:",nrow(coviddata),sep=" "))
## [1] "Number of gathered inputs: 6120"
print(paste("Number of analyzed patients:",max(coviddata$`Patient ID`),sep=" "))
## [1] "Number of analyzed patients: 375"
print(paste("Lowest patients' age:",min(coviddata$Age),sep=" "))
## [1] "Lowest patients' age: 18"
print(paste("Mean patients' age:",mean(coviddata$Age),sep=" "))
## [1] "Mean patients' age: 59.4433006535948"
print(paste("Highest patients' age:",max(coviddata$Age),sep=" "))
## [1] "Highest patients' age: 95"

Analysis of attribute values

## [1] "Patients age"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   47.00   62.00   59.44   71.00   95.00

##  Hypersensitive cardiac troponinI   Hemoglobin    Serum chloride  
##  Min.   :    1.9                  Min.   :  6.4   Min.   : 71.50  
##  1st Qu.:    4.4                  1st Qu.:113.0   1st Qu.: 99.05  
##  Median :   20.6                  Median :125.0   Median :102.10  
##  Mean   : 1223.2                  Mean   :123.1   Mean   :103.14  
##  3rd Qu.:  223.8                  3rd Qu.:137.0   3rd Qu.:105.65  
##  Max.   :50000.0                  Max.   :178.0   Max.   :140.40  
##  NA's   :5613                     NA's   :5145    NA's   :5145    
##  Prothrombin time Procalcitonin     Eosinophils    Interleukin 2 receptor
##  Min.   : 11.50   Min.   : 0.020   Min.   :0.000   Min.   :  61.0        
##  1st Qu.: 13.60   1st Qu.: 0.040   1st Qu.:0.000   1st Qu.: 459.5        
##  Median : 14.80   Median : 0.100   Median :0.100   Median : 676.5        
##  Mean   : 16.68   Mean   : 1.107   Mean   :0.629   Mean   : 907.2        
##  3rd Qu.: 16.70   3rd Qu.: 0.405   3rd Qu.:0.800   3rd Qu.:1155.5        
##  Max.   :120.00   Max.   :57.170   Max.   :8.600   Max.   :7500.0        
##  NA's   :5458     NA's   :5661     NA's   :5163    NA's   :5852          
##  Alkaline phosphatase    Albumin         Basophil    Interleukin 10   
##  Min.   : 17.00       Min.   :13.60   Min.   :0.00   Min.   :   5.00  
##  1st Qu.: 54.00       1st Qu.:27.40   1st Qu.:0.10   1st Qu.:   5.00  
##  Median : 69.50       Median :32.20   Median :0.20   Median :   5.90  
##  Mean   : 82.47       Mean   :32.01   Mean   :0.21   Mean   :  16.07  
##  3rd Qu.: 95.00       3rd Qu.:36.60   3rd Qu.:0.30   3rd Qu.:  12.35  
##  Max.   :620.00       Max.   :48.60   Max.   :1.70   Max.   :1000.00  
##  NA's   :5190         NA's   :5186    NA's   :5163   NA's   :5853     
##  Total bilirubin  Platelet count    Monocytes       Antithrombin   
##  Min.   :  2.50   Min.   : -1.0   Min.   : 0.300   Min.   : 20.00  
##  1st Qu.:  7.40   1st Qu.:109.0   1st Qu.: 2.800   1st Qu.: 74.00  
##  Median : 10.70   Median :178.0   Median : 5.700   Median : 86.00  
##  Mean   : 16.70   Mean   :184.3   Mean   : 6.155   Mean   : 85.32  
##  3rd Qu.: 16.77   3rd Qu.:248.0   3rd Qu.: 8.600   3rd Qu.: 97.00  
##  Max.   :505.70   Max.   :558.0   Max.   :53.000   Max.   :136.00  
##  NA's   :5190     NA's   :5163    NA's   :5162     NA's   :5790    
##  Interleukin 8      Indirect bilirubin Red blood cell distribution width
##  Min.   :   5.000   Min.   :  0.100    Min.   :10.60                    
##  1st Qu.:   8.675   1st Qu.:  3.800    1st Qu.:12.00                    
##  Median :  16.000   Median :  5.400    Median :12.60                    
##  Mean   :  83.088   Mean   :  6.889    Mean   :13.07                    
##  3rd Qu.:  35.200   3rd Qu.:  8.000    3rd Qu.:13.70                    
##  Max.   :6795.000   Max.   :145.100    Max.   :27.10                    
##  NA's   :5852       NA's   :5214       NA's   :5197                     
##   Neutrophils   Total protein   Quantification of Treponema pallidum antibodies
##  Min.   : 1.7   Min.   :31.80   Min.   : 0.020                                 
##  1st Qu.:65.1   1st Qu.:61.00   1st Qu.: 0.040                                 
##  Median :82.4   Median :65.90   Median : 0.050                                 
##  Mean   :77.6   Mean   :65.30   Mean   : 0.132                                 
##  3rd Qu.:92.3   3rd Qu.:70.45   3rd Qu.: 0.070                                 
##  Max.   :98.9   Max.   :88.70   Max.   :11.950                                 
##  NA's   :5163   NA's   :5189    NA's   :5841                                   
##  Prothrombin activity     HBsAg         Mean corpuscular volume   Hematocrit   
##  Min.   :  6.00       Min.   :  0.000   Min.   : 61.60          Min.   :14.50  
##  1st Qu.: 65.00       1st Qu.:  0.000   1st Qu.: 86.90          1st Qu.:33.50  
##  Median : 81.00       Median :  0.010   Median : 90.10          Median :36.60  
##  Mean   : 78.55       Mean   :  8.306   Mean   : 90.39          Mean   :36.55  
##  3rd Qu.: 95.00       3rd Qu.:  0.010   3rd Qu.: 93.90          3rd Qu.:39.90  
##  Max.   :142.00       Max.   :250.000   Max.   :118.90          Max.   :52.30  
##  NA's   :5461         NA's   :5841      NA's   :5163            NA's   :5163   
##  White blood cell count Tumor necrosis factor alpha
##  Min.   :   0.13        Min.   :  4.00             
##  1st Qu.:   4.94        1st Qu.:  6.70             
##  Median :   7.72        Median :  8.60             
##  Mean   :  15.60        Mean   : 11.58             
##  3rd Qu.:  12.72        3rd Qu.: 11.50             
##  Max.   :1726.60        Max.   :168.00             
##  NA's   :4993           NA's   :5852               
##  Mean corpuscular hemoglobin concentration   Fibrinogen     Interleukin 1ß 
##  Min.   :286.0                             Min.   : 0.500   Min.   : 5.00  
##  1st Qu.:333.0                             1st Qu.: 3.050   1st Qu.: 5.00  
##  Median :343.0                             Median : 4.120   Median : 5.00  
##  Mean   :342.8                             Mean   : 4.294   Mean   : 6.51  
##  3rd Qu.:350.0                             3rd Qu.: 5.480   3rd Qu.: 5.00  
##  Max.   :514.0                             Max.   :10.780   Max.   :88.50  
##  NA's   :5163                              NA's   :5554     NA's   :5852   
##       Urea        Lymphocyte count    PH value     Red blood cell count
##  Min.   : 0.800   Min.   : 0.000   Min.   :5.000   Min.   :  0.100     
##  1st Qu.: 4.000   1st Qu.: 0.460   1st Qu.:6.000   1st Qu.:  3.680     
##  Median : 5.985   Median : 0.800   Median :6.500   Median :  4.140     
##  Mean   : 9.589   Mean   : 1.017   Mean   :6.484   Mean   :  9.288     
##  3rd Qu.:11.400   3rd Qu.: 1.310   3rd Qu.:7.294   3rd Qu.:  4.650     
##  Max.   :68.400   Max.   :52.420   Max.   :7.565   Max.   :749.500     
##  NA's   :5184     NA's   :5163     NA's   :5736    NA's   :4993        
##  Eosinophil count Corrected calcium Serum potassium     Glucose      
##  Min.   :0.000    Min.   :1.650     Min.   : 2.760   Min.   : 1.000  
##  1st Qu.:0.000    1st Qu.:2.270     1st Qu.: 3.950   1st Qu.: 5.550  
##  Median :0.010    Median :2.360     Median : 4.410   Median : 6.990  
##  Mean   :0.039    Mean   :2.355     Mean   : 4.509   Mean   : 8.889  
##  3rd Qu.:0.060    3rd Qu.:2.440     3rd Qu.: 4.870   3rd Qu.:10.260  
##  Max.   :0.490    Max.   :2.790     Max.   :12.800   Max.   :43.010  
##  NA's   :5163     NA's   :5206      NA's   :5140     NA's   :5345    
##  Neutrophils count Direct bilirubin  Mean platelet volume    Ferritin      
##  Min.   : 0.06     Min.   :  1.600   Min.   : 8.50        Min.   :   17.8  
##  1st Qu.: 3.09     1st Qu.:  3.225   1st Qu.:10.10        1st Qu.:  377.2  
##  Median : 5.85     Median :  4.800   Median :10.80        Median :  711.0  
##  Mean   : 7.81     Mean   :  9.887   Mean   :10.91        Mean   : 1379.1  
##  3rd Qu.:10.95     3rd Qu.:  8.275   3rd Qu.:11.50        3rd Qu.: 1425.2  
##  Max.   :33.88     Max.   :360.600   Max.   :15.00        Max.   :50000.0  
##  NA's   :5163      NA's   :5190      NA's   :5258         NA's   :5837     
##  RBC distribution width SD Thrombin time      Lymphocyte    
##  Min.   : 31.30            Min.   : 13.00   Min.   : 0.000  
##  1st Qu.: 38.50            1st Qu.: 15.60   1st Qu.: 3.925  
##  Median : 40.90            Median : 16.80   Median :11.450  
##  Mean   : 42.44            Mean   : 18.17   Mean   :15.392  
##  3rd Qu.: 44.70            3rd Qu.: 18.38   3rd Qu.:24.975  
##  Max.   :113.30            Max.   :161.90   Max.   :60.000  
##  NA's   :5197              NA's   :5554     NA's   :5162    
##  HCV antibody quantification   D-D dimer      Total cholesterol
##  Min.   :0.020               Min.   : 0.210   Min.   :0.100    
##  1st Qu.:0.040               1st Qu.: 0.603   1st Qu.:3.010    
##  Median :0.060               Median : 2.155   Median :3.630    
##  Mean   :0.117               Mean   : 7.943   Mean   :3.689    
##  3rd Qu.:0.090               3rd Qu.:21.000   3rd Qu.:4.265    
##  Max.   :2.090               Max.   :60.000   Max.   :7.300    
##  NA's   :5841                NA's   :5490     NA's   :5189     
##  Aspartate aminotransferase   Uric acid          HCO3-          Calcium     
##  Min.   :   6.00            Min.   :  43.0   Min.   : 6.30   Min.   :1.170  
##  1st Qu.:  19.50            1st Qu.: 183.2   1st Qu.:21.00   1st Qu.:1.980  
##  Median :  27.00            Median : 243.7   Median :23.50   Median :2.080  
##  Mean   :  46.53            Mean   : 276.1   Mean   :23.14   Mean   :2.078  
##  3rd Qu.:  42.00            3rd Qu.: 333.8   3rd Qu.:25.90   3rd Qu.:2.190  
##  Max.   :1858.00            Max.   :1176.0   Max.   :36.30   Max.   :2.620  
##  NA's   :5185               NA's   :5186     NA's   :5186    NA's   :5141   
##  Amino-terminal brain natriuretic peptide precursor(NT-proBNP)
##  Min.   :    5                                                
##  1st Qu.:  150                                                
##  Median :  585                                                
##  Mean   : 3669                                                
##  3rd Qu.: 2625                                                
##  Max.   :70000                                                
##  NA's   :5645                                                 
##  Lactate dehydrogenase Platelet large cell ratio Interleukin 6     
##  Min.   : 110.0        Min.   :11.20             Min.   :   1.500  
##  1st Qu.: 218.0        1st Qu.:25.60             1st Qu.:   4.772  
##  Median : 340.0        Median :30.90             Median :  19.265  
##  Mean   : 474.2        Mean   :31.77             Mean   : 112.308  
##  3rd Qu.: 601.8        3rd Qu.:37.20             3rd Qu.:  60.167  
##  Max.   :1867.0        Max.   :62.20             Max.   :5000.000  
##  NA's   :5186          NA's   :5258              NA's   :5848      
##  Fibrin degradation products Monocytes  count PLT distribution width
##  Min.   :  4.00              Min.   : 0.010   Min.   : 8.00         
##  1st Qu.:  4.00              1st Qu.: 0.270   1st Qu.:11.10         
##  Median : 17.90              Median : 0.410   Median :12.40         
##  Mean   : 61.35              Mean   : 0.526   Mean   :13.01         
##  3rd Qu.:150.00              3rd Qu.: 0.580   3rd Qu.:14.30         
##  Max.   :190.80              Max.   :39.920   Max.   :25.30         
##  NA's   :5790                NA's   :5163     NA's   :5258          
##    Globuline     Gamma-glutamyl transpeptidase International standard ratio
##  Min.   :10.10   Min.   :  3.00                Min.   : 0.840              
##  1st Qu.:29.70   1st Qu.: 22.00                1st Qu.: 1.030              
##  Median :32.70   Median : 34.00                Median : 1.140              
##  Mean   :33.24   Mean   : 55.34                Mean   : 1.313              
##  3rd Qu.:36.50   3rd Qu.: 58.00                3rd Qu.: 1.330              
##  Max.   :50.60   Max.   :732.00                Max.   :13.480              
##  NA's   :5190    NA's   :5190                  NA's   :5461                
##  Basophil count  2019-nCoV nucleic acid detection Mean corpuscular hemoglobin
##  Min.   :0.000   Min.   :-1                       Min.   :20.4               
##  1st Qu.:0.010   1st Qu.:-1                       1st Qu.:29.7               
##  Median :0.010   Median :-1                       Median :30.9               
##  Mean   :0.017   Mean   :-1                       Mean   :31.0               
##  3rd Qu.:0.020   3rd Qu.:-1                       3rd Qu.:32.2               
##  Max.   :0.120   Max.   :-1                       Max.   :50.8               
##  NA's   :5163    NA's   :5619                     NA's   :5163               
##  Activation of partial thromboplastin time High sensitivity C-reactive protein
##  Min.   : 21.80                            Min.   :  0.10                     
##  1st Qu.: 35.30                            1st Qu.:  5.70                     
##  Median : 39.20                            Median : 51.50                     
##  Mean   : 41.52                            Mean   : 76.24                     
##  3rd Qu.: 44.12                            3rd Qu.:118.50                     
##  Max.   :144.00                            Max.   :320.00                     
##  NA's   :5552                              NA's   :5383                       
##  HIV antibody quantification  Serum sodium   Thrombocytocrit      ESR        
##  Min.   :0.05                Min.   :115.4   Min.   :0.010   Min.   :  1.00  
##  1st Qu.:0.07                1st Qu.:137.7   1st Qu.:0.150   1st Qu.: 14.00  
##  Median :0.09                Median :140.4   Median :0.210   Median : 28.00  
##  Mean   :0.10                Mean   :141.6   Mean   :0.212   Mean   : 33.69  
##  3rd Qu.:0.11                3rd Qu.:143.5   3rd Qu.:0.270   3rd Qu.: 45.50  
##  Max.   :0.27                Max.   :179.7   Max.   :0.510   Max.   :110.00  
##  NA's   :5842                NA's   :5145    NA's   :5258    NA's   :5737    
##  Glutamic-pyruvid transaminase      eGFR          Creatinine     
##  Min.   :   5.00               Min.   :  2.00   Min.   :  11.00  
##  1st Qu.:  16.00               1st Qu.: 63.58   1st Qu.:  58.00  
##  Median :  24.00               Median : 87.90   Median :  76.00  
##  Mean   :  38.86               Mean   : 81.56   Mean   : 109.93  
##  3rd Qu.:  41.00               3rd Qu.:103.97   3rd Qu.:  98.25  
##  Max.   :1600.00               Max.   :224.00   Max.   :1497.00  
##  NA's   :5189                  NA's   :5184     NA's   :5184

The correlation check section

pearsoncor <- matrix(data=NA, nrow=0, ncol=2)


  for(i in 8:81){
    outcome <- dplyr::pull(coviddata,7)
    analyzed_data <- dplyr::pull(coviddata,i)
    currentfactor <- colnames(coviddata)[i]
    corvalue <- cor.test(outcome, analyzed_data)$estimate
    result <- c(currentfactor, abs(corvalue))
    pearsoncor <- rbind(pearsoncor, result)
    
  }
pearsoncor <- pearsoncor[order(pearsoncor[,2], decreasing=TRUE),]

plot(x <- pearsoncor[1:12,2], main="Most important factors predicting outcome", ylab="Pearson Correlation")

text(pearsoncor[1:12,2], labels=pearsoncor[1:12,1], cex=0.7)

p <- ggplot(coviddata, aes(x=coviddata$Neutrophils, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("Neutrophils") + ylab("Outcome") + ggtitle("Impact of Neutrophils on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p)

p2 <- ggplot(coviddata, aes(x=coviddata$Lymphocyte, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("Lymphocyte") + ylab("Outcome") + ggtitle("Impact of Lymphocyte on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p2)

p3 <- ggplot(coviddata, aes(x=coviddata$Albumin, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("Albumin") + ylab("Outcome") + ggtitle("Impact of Albumin on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p3)

p4 <- ggplot(coviddata, aes(x=coviddata$`Prothrombin activity`, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("Prothrombin Activity") + ylab("Outcome") + ggtitle("Impact of Prothrombin Activity on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p4)

p5 <- ggplot(coviddata, aes(x=coviddata$`High sensitivity C-reactive protein`, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("High sensivity C-reactive protein") + ylab("Outcome") + ggtitle("Impact of High sensivity C-reactive protein on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p5)

p6 <- ggplot(coviddata, aes(x=coviddata$`D-D dimer`, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("D-D dimer") + ylab("Outcome") + ggtitle("Impact of D-D dimer on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p6)

p7 <- ggplot(coviddata, aes(x=coviddata$`Lactate dehydrogenase`, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("Lactate dehydrogenase") + ylab("Outcome") + ggtitle("Impact of Lactate dehydrogenase on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p7)

p8 <- ggplot(coviddata, aes(x=coviddata$`Neutrophils count`, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("Neutrophils count") + ylab("Outcome") + ggtitle("Impact of Neutrophils count on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p8)

p9 <- ggplot(coviddata, aes(x=coviddata$`Fibrin degradation products`, color=factor(coviddata$Outcome))) + geom_histogram(binwidth = 1, fill="beige") + xlab("Fibrin degradation products") + ylab("Outcome") + ggtitle("Impact of Fibrin degradation products on Outcome") + scale_color_manual(labels = c("Survived", "Died"), values = c("darkgreen", "red")) + labs(color="Outcome")
plot(p9)

Interactive graph showing the change of selected attributes over time

coviddata$Gender[coviddata$Gender==1] <- "Male"
coviddata$Gender[coviddata$Gender==2] <- "Female"

#coviddata$Outcome[coviddata$Outcome==0] <- "Survived"
#coviddata$Outcome[coviddata$Outcome==1] <- "Died"

p10 <- ggplot(coviddata, aes(x=Age, y=Gender, color=Outcome)) + geom_point() + scale_color_distiller() + theme_classic() + theme(legend.title = element_blank())
ggplotly(p10)

Patient survival classifier

coviddata.training.indicies <- createDataPartition(coviddata$Outcome, p = 0.80, list = FALSE)
coviddata.training <- coviddata[coviddata.training.indicies,]
coviddata.validation <- coviddata[-coviddata.training.indicies,]

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

library(MASS)
## 
## Dołączanie pakietu: 'MASS'
## Następujący obiekt został zakryty z 'package:plotly':
## 
##     select
#set.seed(7)
#fit.lda <- train(Outcome~., data=coviddata.training, method="lda", metric=metric, trControl=control )

#set.seed(7)
#fit.cart <- train(Outcome~., data=coviddata.training, method="cart", metric=metric, trControl=control )

#set.seed(7)
#fit.knn <- train(Outcome~., data=coviddata.training, method="knn", metric=metric, trControl=control )

#set.seed(7)
#fit.svm <- train(Outcome~., data=coviddata.training, method="svm", metric=metric, trControl=control )

#set.seed(7)
#fit.rf <- train(Outcome~., data=coviddata.training, method="rf", metric=metric, trControl=control )

Importance analysis of the attributes of the best model found

The correlation analysis conducted in the report indicates several factors having major influence over predicted patients outcome. The following parameters seems to have the most significance:

The results of data analysis are consistent with the article „An interpretable mortality prediction model for COVID-19 patients”. The proposed estimated outcome algorithm is dependent on three factors:

All three parameters are among the most significant factors obtained during analysis.